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Abstract. In Adaptive Markov Chain Monte Carlo (AMCMC) simulation, classical estimators 
of asymptotic variances are inconsistent in general. In this work we establish that despite this 
inconsistency, confidence interval procedures based on these estimators remain consistent. We 
study two classes of confidence intervals, one based on the standard Gaussian limit theory, and the 
class of so-called fixed-b confidence intervals. We compare the two procedures by deriving upper 
bounds on their convergence rates. We establish that the rate of convergence of fixed-b variance 
estimators is at least log(n) / \fn, while the convergence rate of the classical procedure is typically 
of order n" 1 ^ 3 . We use simulation examples to illustrate the results. 



Throughout the paper we consider the following setting that covers standard MCMC and many 
AMCMC algorithms, n denotes a probability measure of interest on some measure space (X, B). 
{P$,0 € 0} is a family of Markov transition kernels on (X, B), for some measurable space (Q,A). 
We assume that each Pg admits it as unique invariant distribution, and that the map (x, 6) \— > 
P$(x, •) is (B x „4)-measurable. On some probability space (Q, J 7 , P) with a nondecreasing filtration 
{J^m n > 0}, we consider a J-~„-adapted random process {(X n ,0 n ), n > 0} with values in X x 
such that for any nonnegative function / : X — > R, and n > 1, 



We write E for the expectation operator wrt to P. We call the marginal sequence {X n , n > 0} an 
adaptive Markov chain. Notice that when there is no adaptation (that is 6 n = 9), {X n , n > 0} 
reduces to a standard Markov chain. AMCMC algorithms have recently gained popularity in Monte 
Carlo simulation due to their ability for producing efficient samplers with limited tuning from the 
user. For an introduction and literature review on AMCMC, see e.g. Andrieu and Thorns (2008). 

Let h : X — > R be some function of interest, and suppose that we wish to estimate 7r(/i) = f 
J x h(x)n(dx) (for instance, h(x) = x if we wish to estimate the mean of it). Under some fairly 
general conditions, it is known that 7r n (h) = n~ 1 Ylk=i M^fc) converges to n(h) and satisfies a 
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central limit theorem (Andrieu and Moulines (2006); Atchade and Fort (2010, 2012); Salesman and 
Vihola (2010)). It is also known that in most practical cases, as n — > oo, the bias of Tt n (h) satisfies 
E(-7r n (/i)) — ir(h) = o(n -1 / 2 ) and the variance is such that nVar (n n (h)) converges to a limit called 
the asymptotic variance of h. In these cases, assessing the Monte Carlo error of Tt n (h) boils down to 
estimating the asymptotic variance. For a Markov chain with transition kernel P, the asymptotic 
variance can be written as 

+oo 

4(M = E ( 2 ) 

e=-oo 

where for i > 0, je(P,h) d = J(h(x) - ir{h))P l h{x)ir{dx), and j-e(P,h) = je(P,h). A well estab- 
lished approach for estimating crp{h) is by lag- window obtained by taking a weighted average of 
the sample auto-covariances. More precisely, for < i < n — 1, set 

n-e 

Kn,e = n' 1 E ( h ( X i) ~ ™n{h)) (h(X j+ e) - n n (h)) , and j n -e = -y n , e . 
i=i 

Let {c n , ;n > 1} be an increasing sequence of integers such that c n f oo, and w : R — > M, an even 
weight function (w(— x) = w(x)). The lag-window estimator of &p(h) is 

n-i / u\ 

rl(h)= E W f 7n )fe - (3) 

The precision of the Monte Carlo estimate is then gauged by computing the Monte Carlo error 
\/r 2 (h)/n or equivalently the effective sample size n7 ni o/r 2 (h). Alternatively a confidence interval 
for ir(h) can be formed using Tt n (h) ± z a \jT\ (h)/n, where z a is the appropriate quantile of the 
standard normal distribution. The width of this confidence interval can be used as a stopping rule 
for the simulation (Jones et al. (2006)). All this is common practice in MCMC backed by the fact 
that for c n = o(n), and under some regularity conditions (e.g. geometric ergodicity and existence 
of (2 + e)-moment for h under 7r), T 2 (/i) converges in probability to crp(h) (Damerdji (1995); Flegal 
and Jones (2010); Atchade (2011)). 

Asymptotic variance estimation for AMCMC may behave differently. With {(X n ,9 n ), n > 0} 
as defined above, if n converges to a (possibly random) limit say, the asymptotic variance for 
h is typically 

lim nVar(7r n (/i))=E[<7£(/0], (4) 

where a 2 e (h) = op (h). The same lag- window estimate r^(/i) given in (3) can still be computed 
from the adaptive chain {X n , n > 0}. But as it turns out, if 6* is random, T^(/t) is inconsistent in 
general in estimating the right-hand-side of (4) (Atchade (2011)). More precisely, r^(/i) converges 
to the random limit cr^(/i), instead of the asymptotic variance E [f^ (/*)]• The following example 
illustrates this issue. 



Example 1. The example is adapted from the PhD thesis of David Hastie (2005, University of 
Bristol). Consider the toy density ir(x) = 0.51d(x), where D = [—/3 — 1, —0\ U [j3, /3 + 1], where 
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(3 = 3/4. Consider a Random Walk Metropolis (RWM) with proposal kernel Q$(x, •) taken as the 
density of the uniform U(x — 9,x + 9), assuming 9 > 2/3. It is well known that if 9 is too large 
or too small the resulting RWM kernel will mix poorly. An adaptive version of this algorithm will 
adaptively tune 9 so as to achieve an acceptance probability in stationarity of about 23%. This 
is a common strategy in AMCMC. It turns out that in this case the 23% acceptance probability 
in stationarity can be achieved at three (3) distinct solutions {$i, $2> $3/! say. For the resulting 
AMCMC sampler, 9 n can convergence to any one of these three solutions. For this example, r^(/i) 
converges to the random limit that takes values in {(^(/i), P$ 2 (h), 0"J 3 (/i)}, whereas the 

asymptotic variance is pia^^h) + p2(T 2 d2 {h) +P3c| 3 (/i), where pi = F(9± = (which depends in 
general on the initialization of the algorithm). Figure 1 (c) and (d) show two sample paths of r^(/i) 
with very different limits. However the adaptive chain {X n , n > 0} remains ergodic in the sense 
that fc n (h) converges to ir(h). 

(a) (b) 




50 100 150 200 250 ' 50 100 150 200 250 

Iterations Iterations 



Figure 1: Two sample paths of 9 n and from the toy example. Sample path 1 is (a) and (c). 

In both cases, h(x) = x. 



Despite its lack of consistency in estimating the asymptotic variance, we establish in this paper 
that the lag-window estimators r^(/i) can be used to derive asymptotically valid confidence interval 
for 7r(/i) in AMCMC simulation. The confidence interval is obtained by deriving the limiting 
distribution of the random variable 

_ dcf \/n (7r n (h) — 7r(h)) _ ^/e 

where h = h — ir(h). The key insight of the analysis is that nT^h) behaves precisely like the 
quadratic variation of the approximating martingale of Ylk=i M-^fc)- ^ s a resu lti \/ n ^nW 1S 
roughly the correct scaling statistic in the central limit theorem for X^fc=i ^(X/-), even in a general 
AMCMC setting. 
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The limiting behavior of T n depends on the choice of c n in computing T^(h). When c n = o(n), 
we show (Theorem 2.1) that T n has a standard Gaussian limit. When c n = n, we show (Theorem 
2.2) that J n converges in distribution to a standard Gaussian random variable scaled by an infinite 
sum of chi-squared. The case c n — n corresponds to the so-called fixed-b asymptotics well-known in 
Econometrics (Kiefer and Vogelsang (2005)). Theorems 2.1-2.2 are therefore extension to adaptive 
Markov chains of results that have been established for other type of stochastic models. See for 
instance Kiefer and Vogelsang (2005); Sun et al. (2008) for certain classes of stationary processes, 
and see Atchade and Cattaneo (2011) for Markov chains. These two results allow us to derive 
asymptotically valid confidence intervals for ir(h) in MCMC and AMCMC simulation. We compare 
these confidence interval procedures by simulation. We notice that the approach c n = o(n) is very 
sensitive to the actual choice of c n . In contrast, the case c n = n requires no tuning (since c n = n), 
produces slightly wider confidence intervals but with very good coverage probabilities. 

The simulation results suggest that the lag-window estimator converges faster when c„ = n, as 
opposed c n = o(n). Similar conclusion has been reported elsewhere in the literature, but there are 
very few rigorous results on the topic. Jansson (2004) studied stationary Gaussian moving average 
models and established that when c n = n, the rate of convergence of T n is rT x log(n). Sun et al. 
(2008) obtained the rate n _1 , under the main assumption that the underlying process is Gaussian 
and stationary. It is unlikely that these rates remain true without the Gaussian assumptions. We 
study in this paper the rate of convergence of T^(/i) when {X n , n > 0} is a Markov chain. For 
c n = o(n), we obtain that the convergence rate of T^(h) toward <J 2 (h) is of order n -1 / 3 . But when 
c n = n, we show that the rate of weak convergence of T^(/i) is at least n _1 / 2 log(ra). 

We organize the paper as follows. Section 2 contains the main results. We illustrate the results 
with two simulation examples presented in Section 3. Most of the proofs are postponed to Section 
5. 



1.1. Notation. Throughout the paper, we use the notations: / = / — 7r(/), 7r(/) = f f(x)ir(dx), 
Pf{x) = J f(y)P(x,dy), and P j f(x) = P{pj~ 1 f}{x), with P°f(x) = f{x). For V : X -»• [0,oo), 

def 

we define Ly as the space of all measurable real- valued functions / : X — >■ R s.t. ||/||y = 

sn Pxex\f( x )\/ v ( x ) < °°- 

For sequences {a n ,b n } of real nonnegative numbers, the notation a n < b n means that a n < cb n 

for all n, and for some constant c that does not depend on n. For a random sequence {X n }, we 

write X n = O p (a n ) if the sequence \X n \/a n is bounded in probability. We say that X n = o p (a n ) 

is X n /a n converges in probability to zero as n — > oo. The notation X n A X means that X n 

converges weakly to X. If X, Y are random variables, X d =' Y means that X and Y have the 

same distribution. For a random variable X, and q > 1, we use the notation \\X\\ q = E 1 ^ (\X\*). 

Throughout the paper c n denotes the cut-off point of the lag-window estimator and we assume 

without further mention that c n f oo, as n — > oo. Also we shall use g„\ Qn\ etc... as a generic 

notation for negligible random terms. 
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2. Consistency: statement of the results 

We rely on martingale approximation and martingale theory. Throughout this section h : X — > R 
is a fixed measurable function. For each 6 £ 6, we assume well-defined the functions gg and Pege, 
where 

9e{x) d =^P 3 e h{x), and Pggg(x) d = Pg(x,dz)gg(z), x G X. 

For each G 6, the function g# satisfies the so-called Poisson's equation 

gg(x) -P e g e (x) = h(x). (6) 

def 

For integer n > 1, set L> n = #6>„_iPQ - f0„_i50„_i(-Xra-i)- For p > 1, and integers n > A; > 1, let 



n d ^ f E^p (\Pe n ge n (X n )\ 2p ) , b n a ^ (|P^(X„) - Pg^ge^X^) , 
«n d = f Ki (\D n \*) , 6% ^ a fc _ x + £ 6, + 1 £ 



def 



j=lV(fc-c„ + l) 



j=lV(fc-c„+l) 



and iS = 



j=lV(fc-c„+l) 



/f 2 



In keeping the notations simple, we omit the dependence on p in these terms. We shall convene 
that if a > b, ■ = 0. The main regularity assumption is the following. 

Al For each £ <d, gg and Pggg are well defined, and there exists p > 1 such that, as n — ^ oo ; 



j n n 
«n H ^ Qfc + bk + 



fc=l 



i A k=i 



£4 = °(v^) 



(7) 



A2 There exists a random variable al, positive almost surely such that 

n n 

5> 2 / = oK), and n-^Dl-al, 



k=l 



k=l 



as n — > oo, where p is the same as in Al. 



A3 The function w : M — >■ [0, 1] /ias support [—1, 1], is even and satisfies: w(0) = 1, w(l) = 0. 



Remark 1. We give below in Section 3.1 some drift conditions under which Al holds. A2 depends 
in general on the behavior of 6 n which depends on the specific AMCMC considered. We give an 
example in Section 3. Assumption A3 holds for most kernels used in practice, such as the class of 
Bartlett kernels w(u) = (1 — |n| 9 )l(_ 11 )(n), for q > 1. 



6 YVES F. ATCHADE 

When c n = o(n), T n has a Gaussian limit. To describe this case, we introduce the sequence 



i 

pA2 



\ k — 1 / k — 1 

1 n 

-£ 6 * (a* + C + 4$) + « _ V (a n -i + + C) + ^«o- (8) 



n 

fe=i 



Theorem 2.1. Assume A1-A3 and lim n n 1 c n = 0. If\\m n r n = 0, and 

lim„ n~ pA2 Ylk=i | K fc^i 2 fc} = 0? ^ en o,s n ^ oo, T n (h) converges in probability to o\, and 
T n 4/V(0,l). 



Proof. See Section 5.2. □ 

When c n = n, the limit of T ra is a Gaussian distribution scaled by a sum of chi-squared random 
variables. Define the kernel p* : [0, 1] x [0, 1] ->■ R by 



p*(s,t) = w(t - s) - g(t) - g(s) + / g(u)du, 

Jo 



where g(t) = Jq w(t — u)du. Notice that is symmetric: p*(s, t) = p*(t, s). The kernel p± induces 
a compact operator </> \-t (s i->- /J p±(s,t)(f>(t)dt) on L 2 [0, 1] that we also denote p*. We will assume 
that the kernel is positive definite: for all n > 1, all ai,...,a n G R, and t\,...,t n G [0,1], 
X)iLi Sj=i a i a jP*{U, tj) > 0. The positive definiteness assumption of the kernel p± would imply 
that the operator p* has nonnegative eigenvalues. In which case we will denote {aj,j G 1} the 
(countable) set of positive eigenvalues of p* (each repeated according to its multiplicity), which we 
assume non-empty to avoid trivialities. 

Theorem 2.2. Assume A1-A3 and suppose that p* is positive definite. If c n = n, and lim n r„ = 
then 

Tn 4 Z ° , (9) 

where {Z$,Zi, i G /} are ii.d. iV(0, 1), and {o^, i £ 1} is the set of positive eigenvalues of p±. 
Proof. See Section 5.3. □ 



It is not very convenient to work with the random variable ^o/y Eiei a i^ii because the eigen- 
values of p+ are difficult to find in general. The next result gives an alternative representation of the 



distribution of -Zo/y X^iei a i^i that is more amenable to simulation. The proof is straightforward. 
Proposition 2.3. Let {B(t), < t < 1} denotes the standard Brownian motion. Set 



2 def -. 
X = 1 



f g(t)dt + 2 [ [ p*(s,t)dB{ 
o Jo Uo 



dB(t). 
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Then 

BO) ; (10) 

where {Zo,Zi, i £ 1} are i.i.d. N(0, 1), and i £ 1} is the set of positive eigenvalues of p±. 

Remark 2. In fact, x? = Xligi a i^i nas m & n y equivalent representations. It can be written as a 
double Ito- Wiener integral or a double Wiener integral. More precisely 

rl f'l f'l rl rl 

X 2 = l~ / g(t)dt+ / / p*(s,t)dB(t)dB(s) = / / p*(s,t)dB(t)dB(s) . 
Jo Jo Jo Jo Jo 

v ' V v ' 

double Ito- Wiener integral double Wiener integral 

The difference being that the double Ito- Wiener integral excludes the diagonal p+(t,t)dt = 1 — 
Jq g(t)dt. We prefer the representation given in Proposition 2.3 as a standard (iterated) stochastic 
integral. In fact, it can be easily shown that x 2 a l so has a representation as a double Wiener 
integral of w wrt to the Brownian bridge {B(t), < t < 1}: 

X 2 = [ [ w(s- t)dB(t)dB{t) . 
Jo Jo 

v ' 

double Wiener integral 

Remark 3. Theorem 2.2 requires the positivity of p*, whereas Theorem 2.1 does not require any 
positivity assumption. When w turns out to be positive (in the sense that the kernel k(s,t) = 
w(t — s) is positive definite), then p+ is also positive. This is easy to show and we omit details. 
This result applies for example to the Bartlett kernel given by w(u) = (1 — \u\)l^_i^{u). This 
function is the characteristic function of the distribution with density (1 — cos(x))/irx 2 , x£i, and 
by Bochner's theorem (s, t) i-> w(t — s) is positive definite on [0, 1]. 

Theorems 2.1-2.2 yield two asymptotically valid confidence procedures for ir(h). From Theorem 
2.1, we can form the classical confidence interval 



t n (h)±z 1 _ a/ J F ^, (11) 

V Th 

where T^(/i) is computed using c n = o(n), and Z\- a /2 is the (1 — a/2)-quantile of the standard 
normal distribution. We can also use Theorem 2.2 to propose the fixed-b confidence interval 



n n (h)±t 1 _ a/2] J r ^P-, (12) 

where T^(/i) is computed using c n = n, and ti_ a / 2 is the (1 — a/2)-quantile of the distribution of 
Zo/\Jj2ie\ a i^i- The quantiles t a are intractable in general. But using Proposition 2.3, we have 
%o I -y^Siel ai ^"i B(l)/ \[x 2 i so that these quantiles can be obtained by simulating the Brownian 
motion and approximating the iterated Ito integral /J J* p*(s,t)dB(s)dB(t). For illustration, we 
consider the following two kernels 
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(1) The Bartlett kernel given by w(u) = (1 — |ti|)l( 0) i)(it)- As pointed out in Remark (3), p+ in 
this case is known to be positive definite and is given by 

2 

p+(s,t) = - - s(l - s) - t(l -t) -\s - t\. 

(2) We also consider the kernel w(u) = (1 — u 2 )l( ,i)(u), for which 

p*(s,t) = 2(s-0.5)(t-0.5). 
Thus obviously, p* is positive definite. In fact, in this case x 2 = -^ 2 /6, where Z ~ N(0, 1). 



In Figure 2, we plot the cdfs of ^o/y X^iei f° r these two kernels in comparison with the 

standard Gaussian cdf. We also give some quantiles in Table 1. 




Figure 2: CDF of ^o/A/Siei ai ^i ■ ^ ne standard normal CDF is given as a reference. 



a = 10% a = 5% 
w(u) = 1 - |u| 3.796 4.784 
w(u) = l-u 2 15.590 31.520 

Table 1. The table reports t such that P(T > t) = a/2, where T = Z /Jj2 ie \ otiZf. 



3. Examples 

3.1. Application to adaptive Markov chains with geometric drift conditions. We will now 
illustrate how the assumptions stated above can be checked using drift conditions. We consider the 
following assumptions. 
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Bl For each 6 G 0, Pq has invariant distribution it. Uniformly for 9 G 0, there exist a 
measurable function V : X — > [1, oo), C £ B, v a probability measure on (X, B), b, e > and 
A G (0, 1) such that u(C) > 0, Pg(x, •) > ei/(-)l c (x) and 

P e V < XV + bt c . (13) 

B2 There exists rj G [0, 1/2), positive j n | 0, witft 7„ = O (n~ Q ), a > 1/2, and a /imte constant 
c such that for all n > 1, all (5 £ (0, 1 — rj), and all f G £y/3, wii/i |/|ys < 1, 

\P e J(X n ) - Pe n _J(X n )\ < c 7n V^(X n ), P - a.s. (14) 

Remark 4. Bl is the well known geometric drift condition. In general these drift conditions are 
difficult to check on specific examples, but there are known to hold for a number of target proba- 
bility distributions and algorithms. On the other hand, B2 is the so-called diminishing adaptation 
condition. This condition is in general easier to check and is known to hold for the Random Walk 
Metropolis (RMW) (Andrieu and Moulines (2006)) and the Metropolis adjusted Langevin algo- 
rithm (Atchade and Fort (2012)). Finally, we point out that in the case of a standard Markov 
chain, B2 trivially holds. 

In the present context Theorem 2.1-2.2 can be transposed as follows. 

Theorem 3.1. Assume A3, B1-B2, and take h G C v s, for 5 G [0,1/2 — rj). Suppose that there 
exists a random variable 9* with o 2 Pg (h) positive almost surely, such that 9 n ^4' 9*, as n — > oo. Set 

o~1 = f o~p g (h), and assume that \fn < c n . 

(1) If y/Cn = o (n 1 p a2 ^, as n — > oo, then T^(h) converges in probability to o~1, and T n 
7V(0,1). 

(2) If Cn = n, is positive definite, and a > then (9) holds. 

Remark 5. The assumption that 9 n converges almost surely to a limit depends on the specific 
adaptive algorithm under consideration. Many adaptive algorithms rely on stochastic approxima- 
tion. In this case, conditions under which 9 n converges can be found for instance in Andrieu and 
Moulines (2006); Atchade and Fort (2012) and the references therein. In practice, a simple plot of 
the sample path of 9 n (or some function of it) can give a good indication whether the assumption 
hold. 

Proof of Theorem 3.1. The real number p in Al can be taken as p = 2(5+rf\ ' anc ^ we n °ti ce that 
p > 1, and 2p(S + rj) = 1. It is also well known that under Bl gg and Pggg are well-defined, 
supeee \ge\v? + \Pe9e\vP < °°> and 

supE(V(X n )) < oo. (15) 

n>0 

These results can be found for instance in Andrieu and Moulines (2006). This implies that 

K n < 1, and a n < 1. 
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(14) and (15) imply that b n < j n . We deduce that 



a n + C n 1 ^ a k + ^ b k + 



1 fc=l \ fc=l 



fe=i 



by assumption. Thus Al holds. Recall that D n = ge n _ 1 (X n ) — Po n l g0 n l (X n -i), and <r|(/i) = 
J Tr{dx) J P$(x, dz) {ge{z) — Pgge(x)) 2 . Therefore, by the law of large numbers, we have the almost 
sure convergence to zero of the sequence n _1 Y7k=i (j^k ~ u 1u-\ ^) ( see e '^' Atchade (2011) Propo- 
sition 3.3 for a proof). It follows that rT x Y^k=\ ®\ converges almost surely to cr 2 {h). Thus clearly 
A2 holds. 

We also check that 6® < yfa, and 6™ < 1 + Zt=i h % EfcLi J" a % 1 + c n"°- Jt follows that 



1 + A-ot 1 li c l-a 

-n= i± ^ + ^ + i ^V- (16) 

(1) The conditions c n = o ^n 2 ^ 1 p a2 - > ^ , and a > 1/2 imply that lim n n~ pA2 Efc=i | K fc^i 2 fc| = 
and lim n r ra = 0. Therefore, the conclusions of Theorem 2.1 hold. 

(2) If c n = n, the condition a > ^ implies that lim n r n = 0, and the conclusions of Theorem 
2.2 hold. 

□ 

3.2. A logistic regression example. We assume that 

yi ~ B {pix'iP)) , i = l,...,n, 

where yi G {0, 1}, Xi G M rf is a vector of covariate, and j3 G M rf is the vector of parameter. S(p) 
denotes the Bernoulli distribution with parameter p G (0,1), and p(x) = is the cdf of the 

logistic distribution. Assume a Gaussian prior N(0, s 2 Id) for fj, with s = 20. The posterior 
distribution of f5 then becomes 

vr(/3|A)oc e W e -^l /3 l 2 . 

To illustrate the ideas above, we will consider two commonly used algorithms to sample from tt: 
a plain Random Walk Metropolis (RWM) with Gaussian proposal and the adaptive version of the 
same algorithm presented in Atchade and Fort (2010) (Algorithm 3.1). This algorithm adaptively 
and simultaneously estimates the covariance matrix of the target distribution and implements the 
0.23 acceptance rule. It is known that for this problem, both algorithms satisfy Bl-2 (see e.g. 
Atchade (2011) Section 5.2). 

As a simulation example I test the model with the Heart dataset which has n = 217 cases and 
d = 14 covariates. I first run the adaptive chain for 10 6 iterations and takes the sample posterior 
mean of (3 as the true posterior mean. I repeat the confidence intervals (95% confidence intervals) 
K = 200 times to estimate coverage probability and half-length. Each sampler is run for 30, 000 
iterations. The result is summarized in Figure 3. For the case c„ = o(n), I use c„ = n s for different 
values of S G (0, 1). 
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We see from the results that using c n = n gives very good coverage, but slightly wider intervals. 
The interval width is significantly wider for the quadratic kernel w(u) = 1 — u 2 , which is somewhat 
expected given the very fat tail of the limiting distribution (Figure 2). In contrast, the result show 
that in the setting c n = o(n) careful tuning of c n is necessary to obtain good coverage. As expected, 
the results in this case are similar for both kernels. 



Coverage Probability. Non-Adaptive chain 
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Probability. Adaptive chain 




Interval half-length. Adaptive chain 
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Figure 3: Coverage probability and confidence interval half-length for parameter j3% and for 
different values of 5 using c n = n s . The vertical lines correspond to c n = n. The dashed lines 



correspond to the kernel w(u) 



1 



u 



3.3. A random effect Poisson regression example. We now consider a random effect Poisson 
regression example taken from Gelman et al. (2004). For e = 1, . . . , N e and p = 1, . . . , N p , the 
variables y ep are conditionally independent given ({/? p }, {e ep }) € R n p x R n ' xN p, with conditional 
distribution 

y ep ~ V (n ep e» +a ^ +£ ^ , e = l,...,iV e , p = l,...,N p , 

where V(\) is the Poisson distribution with parameter A. In the above display, {n ep } is a deter- 
ministic baseline covariate, and fj, € K, {ot e } E M Ne are parameters. We assume that the random 
effects {ftp} and {s ep } are independent with prior distributions 

Pp^Nfroj), ee P ~ N(0,a 2 £ ), e = l,...,N e , p = l,...,N p , 

for some parameters <x^ > 0, a 2 > 0. We assume a diffuse prior for (//,, a, cr|, of) (a 2 > 0,cr 2 > 0). 
For identifiability, we assume that Y,k=i a k = 0- Let 9 = fact, P,e,a 2 ,aj) G R3+7V e -i+(7V P +i)7V e> 



12 



YVES F. ATCHADE 



The posterior distribution of 6 given V 
n(9\V) oc exp < ^ Ve,p{^ + a e + /3 P + e 



(Hep-, n ep ) takes the form 



e,p) Tl e p6 



e 



N e N p 



los 



2 1 2 1^2 

c e,p P p=l 



This posterior distribution is typical of probability distributions for which AMCMC are useful. 
A possible MCMC strategy to sample from this posterior is a Metropolis-within-Gibbs. One 
can update a 2 and a 2 exactly as inverse-Gamma IG(0.5(3A p — 2), 0.5 Ylep e e,p) an d IG(0.5(A r p — 
2), 0.5 ^2pZ± P p ) respectively. The parameter [i can be updated exactly as the log of the Gamma 
distribution G(^ ep y ejP , ^ ep re ejP e ae+/3p+ee ' p ). The rest of the parameter a%, 02, (3 P and e e , P can 
be updated one at the time, using one step of a RWM with a Gaussian proposal N(x,a 2 ) with 
a = e~ 2 . One can compare this Metropolis-within-Gibbs sampler with its adaptive version where 
the scaling parameters a of the RWM steps are adaptively tuned using the 23% acceptance rule. 
It is unknown whether these algorithms satisfies the assumptions above. 

I set A^e = 3 and N p = 27 (thus 6 £ M 89 ). I generate an artificial dataset with 02, /i, cr 2 , a 2 a) = 
(0.35,0.15, —1.0,0.1,0.3), and run a preliminary MCMC sampler for 2 millions (2 x 10 6 ) iterations 
and compute its sample mean. This gives a\ = 0.3948 that I take as J aiTT{9\D)d9. We wish to 
construct 95% confidence intervals for ol\. I run each algorithm for 60,000 iterations and discard 
the first 10, 000 iterations as burn-in. This is repeated K = 200 times to estimate the properties 
of the confidence intervals. The asymptotic variance are estimated using only the Bartlett kernel. 
The results are reported in Figure 4 and yield similar conclusions as the previous example. 
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Figure 4: Coverage probability and confidence interval half-length for parameter ax and for 
different values of 5 using c n = n 5 . The vertical lines correspond to c n = n. 
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4. Rate of convergence 

The simulation results presented above suggest that T n (h) has better convergence properties 
when c n = n. We consider this issue here in the case where {X n , n > 0} is a Markov chain 
with transition kernel P and invariant distribution tv, so that the asymptotic variance o~ 2 (h), is 
non-random, and given by (2). The initial distribution of the chain is arbitrary. We assume that 
o~p(h) > and without any loss of generality, we take ir(h) = and crp(h) = 1; otherwise, simply 
replace h by (h — n{h))/yj ap(h). We further simplify the analysis by assuming that P satisfies the 
following geometric ergodicity assumption 

GE There exists a measurable function V : X — > [l,oo) such that 

supE(V(X n )) <oo, (17) 

n>0 

and for all f3 £ (0, 1], 

\\P n (x,-) -tt(-)\\ v p < Cp n vP(x), n>0, xGX. (18) 



When c n = o(n), we have the following. 



Theorem 4.1. Suppose that A3 and (GE) hold. Let 5 G [0,1/4), and h G C v s . As n — > oo ; if 

c n = o(n), then 



E V2 



<- + 



Proof. See Section 5.5. 



(19) 
□ 



Remark 6. Theorem 4.1 implies that the rate of convergence of T^(/i) is of order n~ 1//3 , using c n = 
n 1 / 3 . This rate is known to be tight for kernels with characteristic exponent 1. The characteristic 
exponent of w is the largest number r > 1 such that lim^o M~ r (1 — w(x)) G (0, oo). Our analysis 
does not make use of this concept. If w has characteristic exponent r, it is known (see e.g. Parzen 
(1957) Theorem 5A-5B) that the rate of convergence of T^(h) is + 



^ , using c n 



n 



(l+2r)- 



for certain classes of stationary processes. The Bartlett kernel has characteristic exponent 1, and 
the kernel w(u) = 1 — u 2 has characteristic exponent 2. 

We also consider the rate of weak convergence of T n (h) towards the limiting distribution of 
Theorem 2.2, when c n = n. Denote Lip 1 (M) the set of all bounded Lipschitz functions / : R — > R 
such that 

I ,, def \f(x) - f(y)\ 
l^' Li P = Sl *P \ x _ v \ ^ L 

For P, Q two probability measures on R, we define 

di(P,Q)^ sup 
/eLi Pl (iE 



fdP - / fdQ 
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di(P, Q) is the Wasserstein metric between P, Q. It is well known (see e.g. Dudley (2002) Section 
11.8, Problem 1) that in the case of M, di can be written as 

/oo 
\P(A u )-Q(A u )\du, A u = (-oo,u]. 
-oo 

Thus an upper bound on di (P n , P) gives a Berry- Esseen-type bound on the rate of weak convergence 
of P n to P. In a slight abuse of notation, if X, Y are random variables, and X ~ P and Y ~ Q, we 
shall also write di(X, Y) to mean di(P, Q). 

Theorem 4.2. Suppose that A3 and (GE) hold. Suppose also that I is finite. Let 5 G [0, 1/4), and 
h £ C v s . If c n = n, then 

di(r 2 n (h), X 2 )<^, asn^oo, (20) 
V n 

where \ 2 = J2iei ai ^i > {^»> * G /} are i.i.d. N(0, 1), and {cti, i G /} is the set of positive eigenvalues 
of p*. 

Proof. See Section 5.5. □ 

For the proof we use the Bergstrom method, well known in studying convergence rates in the CLT 
for partial sums (see e.g. Dedecker and Rio (2008)). The log(n) term in (20) is an artefact of the 
method. We conjecture that in general the convergence rate of T^(/i) is n~ 1 / 2 . If we further assume 
that {h(X n ), n > 0} is a Gaussian process with a martingale structure, then the convergence rate 
in (20) can actually be improved to log(n)/n; we omit the details. Quadratic forms has also been 
studied elsewhere in the literature. In a series of papers, F. Gotze and co-authors have studied 
the convergence rate of quadratic forms and obtained the optimal rate of n _1 (see e.g. Gotze 
and Tikhomirov (2005) and references therein). But their setting is different as they assume i.i.d. 
sequence and consider quadratic forms for which the weights do not depend with n. 



Remark 7. The assumption that I is finite is mostly technical and it seems plausible that this result 
continues to hold without this assumption. There are known kernels for which I is finite. For 
example I is finite for the kernel w(u) = (1 — u 2 )l(_i^(u). This is because in this case, p*(s,t) = 
2 (s — ^) (t — Thus /?* admits a unique positive eigenvalue a\ = 1/6 with eigenfunction 4>i(t) = 



t-\. 



5. Proofs 

5.1. Martingale approximation. Much of the analysis relies on the ability to approximate a 
partial sum of the form J2k=i a kh{X] t ) by the martingale Ylk=i a kDk- This is well known and we 
skip some of the details and refer the reader for instance to Andrieu and Moulines (2006). It is 
easy to see from property (1) of the adaptive chain that E [D^Tk-i) = 0. Therefore, under Al, 
{(-Dfc, J^k), k > 1} is a 2p-integrable martingale-difference. Such martingale satisfy Burkeholder's 
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inequality (Hall and Heyde (1980) Theorem 2.10) that we will use repetitively: for any sequence of 
real numbers {a k , 1 < k < n}, and for any q > 1, 



k=i 



< 



J2\\a k D k \\l™ 



.k=i 



i 

qA2 



(21) 



The following martingale approximation for partial sums plays an important role in the sequel. 
Lemma 5.1. Under Al, and for any sequence of real numbers {a k , < k < n}, 

n n 

J2^kh(X k ) = J2 a kD k + 4 0) , (22) 

where the remainder is given by 

def 



k=i 



k=i 



k=l 



e^ 0) = a Po ge (X ) - a n Pe n ge n (X n ) + J2(a k - afc-i)iV 1 $0 fc _ 1 (.X fc -i) 

n 

+ ^2 a k ( p e k 9e k (Xk) - Pe^ge^AXk)) , 



k=i 



and satisfies 



2p 



< \ao\a + \a n \a n + ^ \a k - afc_i|a fe _i + ^ \a k \b k . 



k=i 



k=i 



Combined with (21), this lemma implies that 



^2a k h(X k ) 



k=l 



£4* (23) 



< 1 + \a n \a n + ^2 \a k - a fc _i|a fc _i + ^ \a k \b k + 

2p k=l k=l 

We now show that a similar martingale approximation holds for quadratic forms. This extends 
Lemma 2.1 of Atchade and Cattaneo (2011) which considered the case where {X n , n > 0} is a 
Markov chain. The proof is postponed to the Appendix. 

Lemma 5.2. Assume Al and A3. Consider the quadratic form 



k=ij=i v n ' 



k-j 



Then we have 

^ n n 

fc=i j=i 

for remainders e„ \ e« ^ for which EVPdeWjP) < r^ , where p is as in Al, and 



D k Dj + eW + eW, 



1 n 

= — E 8fe ( flfc+ 5 nl + 6 nl) > and 
n k=l 

(\ n pA2\ 1 " 1 H 

^2 E{ Kfe ( l+<5 2)} ] +-E 6fc ( afc+Gfc - i ) +— E afcKfc+n_la «^ 
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Proof. See Section 6.3 in the Appendix. 



□ 



5.2. Proof of Theorem 2.1. The idea is to show that r 2 (/i) behaves asymptotically like n~ l Ylk=i ^ 
And since the partial sum n -1 / 2 ^]J =1 h(Xk) behaves like « _1 ^ 2 X^fc=i &k as shown in Lemma 5.1, 
it would follow that T n behaves asymptotically like J2k=i Afc/ ^Ylk=i ®\ which satisfies a CLT as 
recalled in Theorem 6.1 of the Appendix. 

As a matter of re-arranging the summations, we can rewrite Y 2 n (h) as follows 



w 



k=l j=l 



k-j 



h(Xj)h(X k ) 



2 

n 



\k=l 



\k=l 



Un 

n 



X>™ ( 24 ) 



\k=i 



where v n (k) d = n 1 Yle=i w (^T^) ' anc ^ Un ^ n 2 S"=i Sfe=i w((£ — k)/c n ). Under A3, it is easy 
to check that v n (k) and u n satisfy 

v n (k) = O > v n (k) — v n (k — 1) = O ^ , and u n = O > (uniformly in A;). 

Therefore, using (23) and (7), we can write (24) as 



1 



w 



k-j 



HXjfhiX^ + gW, 



(25) 



k=l j=l 

where E 1 / p (|^ 1 ^| p ) < c n /n. Combined with Lemma 5.2, it follows that 

n k—1 



k=l k=2 j=l 



W 



k-j 



(26) 



where EVp(|^ 2 ) \p) < r ( n ] + r^ 2) + c n /n. The term J2k=2 D k w({k - j)/c n )D j is a martingale 

array and Burkeholder inequality (21) (applied twice) yields 



n 



n k—1 

EE 

k=2 j=l 



k-j 



D k D, 



n,k 



pA2 



1 

pA2 



(27) 



k=2 



We then obtain that 



(28) 



where 



JE 1/P (l^ 3) | p )<r«+^ 



k=i 



W + ^ + n 
n 



pA2 



E 



pA2 



l 

pA2 



0(1), 
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by assumption. A2 implies that n 1 Ej=i converges almost surely to <j 2 , and we conclude that 
converges in probability to <r 2 . It remains to deal with T n . We have 

E£=i D k , pp(i) 

V n E£=l D fc 

Under A2, n _1 Efc=i -^fe converges in probability to cr 2 that is positive almost surely, and by the 
martingale weak invariance principle (see Theorem 6.1 and the following remark), we have that 
Efc=i D k/\Jj2l=i D l ^ N (°> !)■ Jt follows that T n 4 N(0, 1), and the theorem is proved. 

5.3. Proof of Theorem 2.2. The idea of the proof is that for large n, and for c n = n, r 2 (/i) be- 

haves like £, e i a t ELi <f>j (I) V)', and that ^ ELi ^' (I) Afe behaves like £ ^-(t)d£(t) ~ 
N(0, 1). To carry the details, we start again from (24) which gives 

^^^(v) h(x )h(x k )- 2 - (x>(x*)l (x>(*0&to)+^ (X>to1 

fc=lj=l ^ ' \fc=l / \fe=l / \fe=l / 

With c n = n, v n (k) = n~ x Yle=i w (^r) * s ^ n e right-Riemann sum approximation of g(kn~ 1 ), 
where g(t) = w(t — u)du. Thus with the smoothness of w, 

\v n {k)-g{kn- 1 )\=o(^j , \v n (k)-g{kn- l )-v n {k-l) + g({k-l)n- l )\ = ° Qa) > 



and 



5 (i)cft 
o 



O ( — ) , (uniformly in k). 



By combining this with the linear and quadratic martingale approximation (Lemma 5.1, Lemma 
5.2), we obtain that 



= I E E {» (^) - « (*) - *(£) + i *>*} D ^ + «i 

-sEE*^)^^' (29) 



k=l j=l 



where 



^(|^r)<r«+r( 2 ) + i. 

It is assumed that the kernel : [0, 1] x [0, 1] — > R with p*(s, i) = w(s — t) — g(t) — g(s) + j^ 1 g(u)du 
is twice continuously differentiable and positive definite. By Mercer's theorem, there exist positive 
eigenvalues {aj, j G 1} and orthonormal eigenfunctions {<f)j, j G 1} 0j G L 2 ([0, 1]) such that 

p*(s,t) = '^2a j <f> j (8)<f> j (t). (30) 
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Notice that Jq p*(s, t)dt = which means that is also eigenvalue of p* with associated eigenfunc- 
tion 4>q (4>o(t) = 1). Using the expansion (30), we conclude that 



j<=\ \ v k=\ 



x 2 



"3 



where £>&^ is as in (29). It follows that 



j yg SE=i *(**) ELi *(**) y^E^i 

\ * ■* / * I "j . /v^ n2 



We shall write T = f {0} U I. Consider the Hilbert space ^ 2 (a) = f {x £ i' : SfceT a & x i < °°}' where 
«o = 1) equipped with the norm ||x||2 = \/Ylj£\ a j x ] an d the inner product (x, y) = Y2j&\ a j x jyj- 

def J Efc=i0j(t)^fe • _t\ • „ -,2/ 



The random variable ^ n = I — j G I > is a ^ (a)-valued random process. We prove in 
Lemma 5.3 below that ^ n \E', where 

and {-B(t), < t < 1} is the standard Brownian motion. The theorem follows from the continuous 
mapping theorem, since {J^ 4>j(t)dB(t), j G T} are i.i.d. N(0, 1). 

Lemma 5.3. Under the assumptions of Theorem 2.2, \l/ n ^> as n — > oo. 

Proof. We set the notation 5* = Ylj=i D 3 ( s o = 0), U% = Y$=i D ji ( U o = 0), and s k = 
Ej=i E ( £) |) ( s o = 0). Consider ( n : [0,1] ->• R, the C[0, l]-valued process obtained by inter- 
polating the points (f£, f ) , . . . , (ft, f ) , . . . , (l, ^) . That is 

C B (t) = ft (jwt J ^=1, for ^i< t <^. 

t^n ^fc s k—l U n S n S n 

Then we have 



/ ^(t)dCn(t) = £ — ^ L" — = E<Mi) 



Now, conveniently, we introduce the integration map M : C[0, 1] — >■ £ 2 (a) defined as M(h) = 
{Jo 4>j(t)dh(t), j G T}. These integrals are well-defined since the functions <£j are continuously 
differentiable (Theorem 6.2 (ii)). Furthermore, the inegration by part formula gives Jq <pj(t)dh(t) = 
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<j)j(l)h(\) - <f>j(0)h{0) - /J h^'^dt. Therefore, for h, h £ C[0, 1], 



\\M(h)-M(h )\\l = J2*j 



i ,1 
(f)jdh — I c/)jdho 

v Jo 



2 



< mh-hoWtc sup VQ j |^(t)| 2 + 2||/i-/i ||Ly"ai [ Wj} 2 {t)dt < co\\h- ML 

0<t<l^ -r Jo 

where the last inequality uses (36-37). This establishes that A4 in fact takes values in £ 2 (a) and is 
Lipschitz. Now, it is clear that we can write 

where the j-th component of e n is 



i3 ) ' U„ 



With A1-A2, we have the weak convergence of ( n towards the standard Brownian motion (see 
Remark 8), and by the continuous mapping theorem, M(( n ) — > The lemma is proved by 
showing that e n converges in probability to zero in £ 2 (a). 

Negligibility of e n . 

_ „ Ej 6 r a i {ELi ( 6 n,k(j) - 4>j (f )) 

IMI 2 = 2^ a j e nU) = ^2 • 

je \ n l^k=l k 

Since n _1 53fc=i ®\ ^ as a positive limit almost surely, it is enough to show that the numerator 
converges to zero in probability. Towards that end, we have 



E (5> it (;)) %}) = iE E (°SE^ 

\ iFl kfc— 1 / / fc— 1 



k^ 2 



n 



J<=\ Kk=l ) I k=l jg| 

For any arbitrary continuously differentiable function / : [0, 1] — >■ R, w € [0, 1], and < a < b < 1, 
the following bound holds true: 

2 



— f f(t)dt-f(w)) <(b-a) f\f'(u)} 2 du. 

- a J a / JO 

£fe 

With 5 Ht k(j) = Sk -s k l I a T-i <f>j(t)dt, we use the above inequality to write 

E (^') - ^ (£) ) 2 < ^f^ 1 E ^ jf W» 2 * z 



We conclude that 
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asn->oo, since J2k=i ^(lAs| 2p ) = o{n p ). This completes the proof. 



□ 



5.4. Proof theorem 4.1. The idea is to use the decomposition (25), together with Lemma 5.2 
and a more careful bound on the term in Lemma 5.2. The main ingredient of the proof is again 
martingale approximation. We recall that cr 2 (h) = 1 and 7r(h) = 0. Since P no longer depend on 
6, we write g instead g$, Pg instead Pege etc... We gather from (25) in the proof of Theorem 2.1, 
and Lemma 5.2, that 

rl(h) - 1 = It (Dl - 1) + £l>X> ( k -^A Dj + tf + e « 

l 1 I 1 A — l \ n ' 



k=l k=2 j=l 



where 



and 



n k-2 

e« = Pg(Xk-i) K(* - j) - w n (k - 1 - j)) h(Xj), 

k=3 j=l 

Using (27), if follows that T 2 n (h) -1 = 1 ££ =1 " l) + + ^ 2) , where E 1 /^^) < _^ + 

Cn I / <^ / Cra 

n V n ~ V n 

We know from Lemma 5.2 that E 1 / 2 (|en \ 2 ) < r^ 1 - 1 < -h==. But under the current assumptions, 
it is possible to obtain a better bound. We further use the Poisson equation approach to write 
Pg{x) = U(x) — PU(x), where U(x) d = J2j>o P^ +1 g{x). By the assumption (GE), and since 
g € C v s, U is well defined. Set B n ^ 2 *= Z^=i {w n (k - j) - w n (k - 1 - j)) h(Xj), and S n j = f 
w n (k — j) — w n (k — 1 — j). By A3, and using the mean value theorem, we get \S n j\ < and 
\$n,j — 8n,j-i\ < ~2" j uniformly in j. Then using (23), it comes that 

E 1/4 (|i?„, fc -2| 4 ) < ~^=, and E 1 /* (|£ n , fe _ 2 - S^sl 4 ) < — ■ 



The second inequality follows from the bounds on |<5 nj -| and \S n j — S n j-i\, and the fact that 



fc-3 

B n ,k-2 - #n,fc-3 = (W n (2) - W n (l)) h(X k _ 2 ) + ^ {$n,j ~ S n ,j-l) K X 3, 

3=1 

Now, from Pg(x) = U(x) — PU(x), we get 



4P = E ( u ( x k-i) - PU(X^i)) B n>k _ 2 = £ (l/(X fc _i) - PU(X k . 2 )) 5 n , fe _ 2 

fc=3 fe=3 

n 

+ ^P£/(-Xfc_ 2 ) {B n ,k-2 - B ni k-3) - PU(X n -i)B n , n -2. 
k=3 

Noticing that X)fc=3 (^(-^fc-i) ~~ PU{X k - 2 )) B n ^- 2 is a martingale array, we deduce easily from 
the above that 

& ,2 (\®?)<-. 
\ / c n 
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Under assumption (GE), and since 5 G (0,1/4), we have E 1 / 2 (££=i {D 2 k - l)f 
conclude that 



< Jn. We 



E l/2 



(r^)-i)' 



1 1 Cn , 1 

< 1 1- 4 /— < 1- 

ft Cn y 71 Cn 



k j_ 
n n 



D k D r 



which ends the proof. 
5.5. Proof theorem 4.2. We define 

1 n n 
k=l j=l 

We recall from (29) that 

where E 1 / 2 (j£rt^| 2 ^ < + + n~ x < rT l l 2 (see the proof of Theorem 3.1 for the bound on 
+ Vn ^). This implies that 



di(r^),x 2 )<d! (r 2 , x 2 ) + 



n 



(31) 



Therefore we only need to focus on the term di (f 2 , x 2 )- 

On the Euclidean space R 1 , we shall use the norms ||x|| 2 = Yl%e\ a i x h IMP = x i an d the 
inner-products (x, y) a = Ylie\ a i x iUii an d ( x > v) = Ylie\ x iUi- For a sequence (ai, 02, . . .), we use the 
notation aj : fc = (a^, . . . , a&) (and is the empty set if i > k). We introduce new random variables 

{Zij, i G I, 1 < j < n} which are i.i.d. N(0, 1), and set S £:fc d = (yj^ Zij,---, Ej =i Z \j) G ^ 
so that 



2 dist. 



J'=l 



n 



<S II 2 



- , * € 1, ^ < j < fc. 

n 



For 1 < £ < k < n, and omitting the dependence on n, we set as the M lx ( fc e+1 ^ matrix 

By the Mercer's expansion for p+, we have 

( I n 



fc=i 



1 



Bl :n -Dl :n 



For / G Lip 1 (R), we introduce the function / Q : Rl'l — > R, defined as / a (x) = /(||x|| 2 ). As a 
matter of telescoping the sums, we have 



E[/(f 2 )-/( X 2 )] =E 



fa ( — 7=Bi :rl Z)i :n ) - /a 



n 



n 



>l:n 



1 



n 



fa —7='&l:eDl:e + -^=S^ + i :n — f a [ -^=Bi : ^_iI?i : ^_i H — ^S^ : „ 



1 



1 



n 



£=1 



fa,n,e+l —j=R\:l-\E>\:l-\ + -^B«Z)^ - f a ,n,e+l —p^l-.i-lDl-.t-l + —j=Sl-.t 



n 



n 
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where we define 



fa,n,t( X ) = E 



fa ( X + ^S £: 



n 



, and set / a ,n,n+i(x) = f a (x). 



We deal with the case I = n separately. Indeed, it is easy to check using the Lipschitz property of 
/ that 



E 



fa ( — ^Bi :n _iZ?i :n _i H — — B n:n D n 
n Jn 



fa 



Bi :n _iDi :n _i H — ^S n:n 
n Jn 



< E 



1 



1 



1 



1 



^Bi :n _iDi :n _i H — — B n:n D n || — || -pBi :n _iDi :n _i H — ^S n:n || Q 
\M V" V n v n 



< J_ 
~ Vn' 



For the rest of the proof, we assume 1 < i < n — 1. First, we claim that / Q>n ^ is differentiable 
everywhere on M}. To prove this, it suffices to obtain the almost everywhere differentiability of 



z G M 1 i->- / a (x + z) for any x G 



By Rademacher's theorem, / as a Lipschitz function is 



differentiable almost everywhere on M. If E is the set of points where / is not differentiable, we 
conclude that f a is differentiable at all points z ^ {z € IR 1 : ||x + z|| 2 G E}. Now by Ponomarev 
(1987) Theorem 2, the Lebesgue measure of the set {z £ I 1 : ||x + z\\ 2 a £ E} is zero, which proves 
the claim. 

As a result, the function x i— >■ f a) n,i{x) is differentiable with derivative 



v/ Q , n/ (x) ■h = m 



fa [X 



1 



Sf-r 



X + 



1 



h 



>n J \ yn 
By writing his expectation wrt the distribution of x + -^=St n , we get 

Vf a , n/ (x)-h = 2 J f' a (z) (z,h) a ex V (- — - - (||x|| 2 - 2 (x, z))) fi n , e (dz), 

where fi n £ is the distribution of -4=S£ :n . This implies that f an e is infinitely differentiable with 
second and third derivatives given by 



V^/a,n,*(x)-(/n,M 



n 



y /aW ^l)a fa - Z, h 2 ) exp ^- 



n 



= 2E 



X 



n 



+ 



2{n-£+ 1) 



2 (x,z)) ) Hn,t{dz) 



, /il 



n-^+1 ^ - I + 1 ' / a \y/n-£+l 
which implies after soe easy calculations that for li£l', 

V< 2 >/ a>M 0r)-(M)|< 



1 + 



n-£ + l' 



(32) 



Similarly, 



V^/a,n^(x)-(/ il ,/ i2 ,/ l3 )=2 



n 



n-t + l 



-E 



f'a[ x + ~l=Skn ) ( XW ^— T + , ■ = , /»1 

'n J \ V n — i+\ y/n-£ + l 



x {h 2 ,h 3 ) 



Vn-£+l 



,tl2 



+ 



, h 3 



-J 1 117,113 

n-e + i 



1 + 



n 



n-e + r 
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and for 

V^f a ,nAx)-(h,h,h) 

Then by Taylor expansion we have 

fa,nl+l ( —j=R\:t-\E>\:l-\ H l^l-.l^l ) _ fa,n,i+l ( —j=R\:t-\D\:l-\ H 7=S^ ) 

\V n V n / \V n V n J 

FV/ a ,n,w(~r8l:<-l°l:H) ' ( B «A ~ S«) 

n vn 



where, using (33), 
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(33) 



+ ^V^/^+^B^Z^!) • [(B^,B^) - (S«,S«)] + e <g, 



o (3) 



< 



n 



It follows that 



n— 1 /t ii 

E E 0S|) ^ _1 E-^ + -" /2 E 7 s ^ 1/2 lo ^) 



-3/2 ! + 



t-l 

n-e + i 



Bl^_lDl:£_l 



|B^.D^||a + ||S^|||) . 



(34) 



By first conditioning on Tt-\, we have 



E 



V fa,n,e+i(—/=Ri:e-iDi : e-i) ■ (BteDe - Seu) 
v ^ 



Writing d = 5V (2) / a ,„^(^Bi^_i£)i^_i), we have 

V^/a,Ti^+l(-/=Bl:^-l-Dl:^-l) ' Ph^)B< : A) - (S^, S^)] 



n 



K nj e(i,j)Z it eZj£ 



Therefore, 



E (VW /^(-^B^^^O • [(B^.B^) - (S«,S«)] |^) - 

E^ (£) ^ (^) t E (^ 2 i^-i) - <y 



= E 



K n , e+1 (i,j) [E (D, 2 |^-i)-l]+E' 



7) 



n 



^71,^+1 (*,j)> 



where <5y = 1 if z = j and zero otherwise. We claim that the proof will be finished if we show that 
for all i,j e I, and 1 < £ < n, 

// 



E l/2 



(K n/ (i,j) - K n/+1 (i,j)Y 



~n-£ + l' 



(35) 
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To prove this claim, it suffice to use (35) to show that In 1 Y^=i 4>i (!) 4>j (|) E (K n j + i(i, < 
n-VajogCn) for j, and {n' 1 ££ =1 ^ (!) & (n) K n,e+i(hj) [E (^fl^-ii |- l] | < n^k^C 
for all i, j G I. To show this, write 



n-l 



n 



E 



(^ + i(M')) = 

I €=1 

n-i r e-i 

+ E E 



n * — ' n 

L fe=i 



n 



n 



E(K n , n (i,j)) 



[E{K n/ (i,j)-K n/+1 (i,j))] 



By the convergence of Riemann sums, |^X)?=i^i (!) *A? (!) ~ n Combined with (32) and 
(35), this implies that 



1 n 

E- 



K(K n/+1 (i,j)) 



< 



1 



n 



fe=i 



< 



log(n) 



n 



For the second term, let £/ denotes the Poisson equation solution associated to E (Dg\Fe-i) — 1, 
so that we have almost surely 

U(X^i) - PU(X e -!) = E - 1. 

Therefore, by the usual martingale approximation trick 

n— 1 



n 



E(K n ,, +1 (z,j) [E (Z>?|Ji_i) - 1]) = -fr (-) <j>j (-)E(K n , 2 (i,j)U(X )) 



n— 1 



n 



K n ,e+i(i,j) - 4>i 



£-1 



n 



n 



K n/ (i,j)\U(X^i 



We now use the fact that <pi<f)j is of class C 1 (see Theorem 6.2 (ii)), (32), and (35) to conclude that 



n-l 



log(n) 



£ 7^ + n E El/2 (K,m(*,i) " *W + 2(i,j)| 2 ) < 

This proves the claim. It remains to establish (35). Write E^ to denote the expectation operator 
wrt n~ l / 2 Sp :n . We then have for any hi, h 2 € R 1 , 

2K n/ ■ (hi, h 2 ) = V^/<W (-^-B^D^-^ ■ (hi, h 2 ) 
f 

J a 

,n,£+l 



n 

n-e + i 

n — 



1 j-. S^- n 

Bl:«-l-Dl^-l H t=- 

n vn 



n-i + 1 



n vn 



n 



Bi^_iL>i : £_i + —= ) ■ (hi, h 2 ) 
n Jn ' 



n-i + 1 



o 
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Therefore 



2 (K n/ - K n/+ i) ■ (hi, h 2 ) = 
V (2) f a ,n,e+i ^Bi :< _iD w _i + -^0 -(hi,h 2 )-V^f aM ^B w _iDi :< _i + ^0 -(/n,^) 

. v*> W+1 (^B 1:( . lDl ,_, + « * + (! - t )^f) ■ (fc.fc, * - 3f ) 

for some i G (0, 1). Using (32) and (33), (35) follows from the above. 

6. Appendix 

6.1. A weak invariance principle for martingales. We recall a martingale weak invariance 
principle from Hall and Heyde (1980) Theorem 4.2. Let {D n , T n , n > 0} be a martingale difference 
sequence. Set S k = E*=i^i, U% = £j =1 Z>?, and s\ = £j =1 E(I>?) (5 = 0, f/ 2 = 0, s 2 , = 
0). Consider Q n : [0,1] — >■ R, the C[0, 1] -valued process obtained by interpolating the points 
'•••'(!'&)'■■•' (Lfe). where C/ fc = v ^. 

Theorem 6.1. As n — > oo, suppose that 

(1) s 2 — > oo, and % converges almost surely to a random variable that is positive almost surely. 

(2) For all e > 0, s~ 2 £?=i E (d? l { \ D . >e8n} ) 0. 



Then ( n , as a random process in C[0, 1] (equipped with the uniform norm), converges weakly to the 
standard Brownian motion. 

Remark 8. Assumptions A1-A2 imply that the conclusion of Theorem 6.1 holds for the martingale 
difference {-Dfc, 1 < k < n} of Section 2. To see this, notice that for p > 1 as in Al and for any 
M > 0, 



i n 

n ' 



k=l / 

Therefore, under Al, n -1 ?/ 2 is uniformly integrable. Now, by A2, n^ 1 U 2 converges almost surely 
to cr 2 . We conclude that n" 1 s„ = n _1 E(lT 2 ) converges to E(cr 2 ) > 0, so that s n — > oo, and U%/s n 
converges to (T 2 /E((T 2 ). Thus part (1) of the theorem holds. For all e > 0, 



n _. n _. n 

*n 2 E E (^l(l^l>«-}) s j» E «? s ^ E «?> 

thus part (2) also holds under A2. 



j=l ' S n j= i - i= i 
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6.2. Mercer's Theorem. We recall Mercer's theorem below. Part (i) is the standard Mercer's 
theorem, and part (ii) is a special case of a result due to T. Kadota (Kadota (1967)). 

Theorem 6.2 (Mercer's Theorem). (i): Let k : [0,1] x [0,1] — > M be a continuous positive 
semi-definite kernel. Then there exist nonnegative numbers {Xj, j > 0} ; and orthonormal 
functions {<pj, j > 0}, 4>j € L 2 ([0, 1]), such that k(x,y)4>j(y)dy = \j(j)j(x) for all x € 
[0,1], j > 0, and 



lim sup 

n ^°° x,ye[0,l] 



3=0 



Furthermore, if Xj / 0, is continuous. 
(ii): Let k as above. If in addition k is of class C 2 on [0, 1] x [0, 1], then for Xj / 0, <pj is of 
class C 1 on [0, 1] and 



lim sup 

x,i/G[0,l] 



d 



o. 



By setting x = y, in both expansions, it follows that 



sup Xj\(pj(x)\ 2 < sup k(x,x)<oo. 



0<x<l 



i>o 



0<x<l 



and 



sup 5^Aj|^-(x)| 2 < sup 



0<x<l 



i>o 



0<x<l 



d 2 

dudv 



k{u, v)\ u=x>v= 



< oo. 



(36) 



(37) 



6.3. Proof of Lemma 5.2. 

Proof. Set w n (0) = 1/n, and for k > integer, w n (k) = 2n~ l w{k/c n ). Then we can rewrite Q n as 

n k 

<5" = E E ^ " 3)h{X k )HXj). 
k=i j=i 

Using the Poisson's equation /i(x) = ge(x)—Pe9e(x), it holds almost surely that h(X k ) = ge k _ 1 (X k ) — 
At-i^-iP^O- Therefore 

h{X k )h{Xj) - DkDj = (h(X k ) - D k ) h{Xj) + D k {h{Xj) - Dj) 

= (iViS* fc -i(**-i) - Pe^QOH-AXk)) HXj) 

+ D k (P e ._ l90 ._ 1 (X j . 1 ) - P e (Xj)) . 

Then setting e n = Q n - J2k=i Ylj=i ™n{k - j)D k Dj, we obtain 

n k 

€n = J2J2^n(k-j) (iV 1 <fo fc _ 1 (-X fc _i) - Pe^ge^iXk)) h(Xj) 
k=i j=i 

n k 

+EE^(*-^( p «H»i-i(^i)-ww) = ^ + si 2) ' ( 38 ) 

fc=l j=l 
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We start with the second term on the right hand side of (38) that yields after some rearrangements 

n k 

k=i j=i 

n I k-1 

= E Dk \ E " i) ~ ^ k ~ i + *)) p %-i^-! (Xj-i] 

k=2 \j=l 
n / k-1 

+ E D A E ^ " J) ( P %^ PO) - Pe^ge^ {Xj)) 

k=2 \j=l 



+ Pe ge Q (X )Y, lI, n(k)D k + (w n (0) - w n (l)) j^DfcP^s^pffc-i) 

fc=i 

n 

5„(0) £iVVi0* fc -i(**) = T« + TP, 



fc=i fc=i 



fc=i 

where T^ 2) = -^(Oj^i^fe-iS^-ift)- We deal with T^ 2) below. Notice that has 
the general form Ylk=i B n ,kDk, where B n ^ is J^-i-measurable. Thus by Burkeholder's inequality 
applied to the martingale X^fe=i Pn.fc-Dfcj we derive after some calculations that 

■ pA2 



T (l) 



; 2 < e ii^ir * e ii^s 2 < A2 ^e{^(i+ F 



fe=i fc=i fc=i 

The first term on the right hand side of (38) gives 

n k 

= E (*Vi0* fc -i(**-i) - Po^get-AXk)) E^( fc - 

fc=i i=i 

fc=i i=i 

n A; 

+ E ( p <fc0**(**) - ^ fc -i^*-i(^)) 2>n(* - 
fc=i i=i 

Then we write 

n k 

E (JVis«*-i(*k-i) - p^ fc m) E*«( fc - 

fc=i i=i 

n f k k—1 

= E s E - - E ^ - 1 - 3)KXj) 

k=l [j=l j=l 

n 

-P en g e jX n )Y,™n(n-3)h(Xj)- 

3=1 

This implies that 
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k-2 



k=3 



fc 



fc=i i=i i=i 

and 

n n 
R { n ] =W n {Q)Y J P 8 k -,9e k AXk-l)KXk) + {w n {l)-W n m 



k=l 



k=2 



We gather these terms together and rewrite (38) as 

e n = eU+TW + RW + TW + RW. 

Using (23) we get: 

k 



(39) 



4 1} <E 6 * 



fc=i 



E^n(fc-i)M^) 



+ On 



2p 



E^n(n-jX^j) 



J'=l 



2,; 



n 



< 77, 



k=i 



With the same technique we get 



fc=2 



Remark 9. In fact, as we shall show later, with additional assumptions, the term has better 
convergence rate than shown above. See the proof of Theorem 4.1 in Section 5.4. 

The last two terms in (39) are 



T (2) + R m = _^ n(0) £ D^e^ge^ (X k ) + u) n (0) E iVii»*-i M*k) 

fe=i 

n 

+ («) n (l) - w n (0)) E Pe^ge^ (40) 



fc=i 



fc=2 



Replacing h(Xk) by gg kl (X k ) — Pg kl gg kl (X k ), the first and third terms on the right hand side 
of (40) gives after some easy re-arrangements 



n— 1 



fc=l 



- ^(Olp,,,^^)^ + (tD n (l) - w n (0)) E P9 k 9e k {X k )h{X k ) 

k=l 

n-1 

= (w n (l) - w n (Q))Y,Pe k ge k (X k ) {Pe^ge^AXk-i) - Pe k ge k {X k )) + q%\ 



k=i 



ADAPTIVE MARKOV CHAIN MONTE CARLO CONFIDENCE INTERVALS 



29 



where E^Pfl K n + n 1 c n 1 Y2k=i a k^k + n 1 Ylk=i &fc( a fc + Kfe)- The second term on 

the right-hand side of (40) gives 

n n 

w n (0)Y J P o k -,9e k ^ 1 (Xk-i)HX k ) = i5„(O)£iVi0* fc _i(**-i) (iVi0**-i(**-i) " ^</0 fc TO) 



fc=i 



fc=i 



where E 1 /* (|^ 4) |) < n" 1 (ELi(«fe-i^) pA2 ) ** 2 + n" 1 ELi Ofc-i&fc- Therefore 
T( 2 ) + 4 2 ) = *>(?) + <#) + ^(0)(P, ^ (X )) 2 - ? 2}„(0)P, n5 , n (X n )P, n _ 1 ^ n _ 1 (X n _ 1 ) 

n— 1 n— 1 

+ K(l) - 2w n (0)) £ Pe.ge^Pe^ge^Xk-i) + (2«) n (0) - «) n (l)) £ (P^ fc (X fe )) 2 , 

and 



fc=i 



fc=i 



T( 2 ) + 7?( 2 ) 



< 



,(3) 



+ 



1 1 

+ - (oq + a n a„_i) H y~] a k (a k + a fc _i) 

p n nc n f— f 



fc=i 



We obtain the lemma by putting all the remainders together. 



□ 
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